Project Description¶

In this project, we will apply our statistical testing skills to analyze historical data from men's and women's international soccer matches.

We imagine ourselves working at a major online sports media company that specializes in soccer analysis and reporting. Over many years, we have followed both men's and women's international soccer matches closely. Our intuition suggests that more goals might be scored in women's international matches than in men's. This could be an interesting topic for an investigative article that our audience would enjoy. However, to be confident in this idea, we need to perform a proper statistical hypothesis test.

We understand that soccer has changed a lot over time and that performance can vary depending on the tournament. Therefore, we will focus our analysis only on official FIFA World Cup matches (excluding qualifiers) played since January 1, 2002. This will help keep our comparison fair and relevant.

We have two datasets in CSV format:

  • women_results.csv
  • men_results.csv

These datasets were collected from a reliable online source and contain match results from women’s and men’s FIFA World Cup games.

The main question we want to answer is:

Are more goals scored in women's international soccer matches than in men's?

For this analysis, we will use a significance level of 10% (0.10) and set up the hypotheses as follows:

  • Null hypothesis ($H_0$): The average number of goals scored in women's international matches is equal to that of men's matches.
  • Alternative hypothesis ($H_A$): The average number of goals scored in women's international matches is greater than that of men's matches.

With this setup, we will perform the appropriate statistical tests and interpret the results to provide clear insights.

Let's get started!¶

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import pingouin
from scipy.stats import mannwhitneyu
In [2]:
# Load the datasets
men = pd.read_csv("men_results.csv")
women = pd.read_csv("women_results.csv")
In [3]:
men.head(3)
Out[3]:
Unnamed: 0 date home_team away_team home_score away_score tournament
0 0 1872-11-30 Scotland England 0 0 Friendly
1 1 1873-03-08 England Scotland 4 2 Friendly
2 2 1874-03-07 Scotland England 2 1 Friendly
In [4]:
women.head(3)
Out[4]:
Unnamed: 0 date home_team away_team home_score away_score tournament
0 0 1969-11-01 Italy France 1 0 Euro
1 1 1969-11-01 Denmark England 4 3 Euro
2 2 1969-11-02 England France 2 0 Euro
In [6]:
men.drop('Unnamed: 0', axis = 1, inplace = True)
women.drop('Unnamed: 0', axis = 1, inplace= True)
In [11]:
# Filter the data for the time range and tournament
men["date"] = pd.to_datetime(men["date"])
men_subset = men[(men["date"] > "2002-01-01") & (men["tournament"].isin(["FIFA World Cup"]))].copy()

women["date"] = pd.to_datetime(women["date"])
women_subset = women[(women["date"] > "2002-01-01") & (women["tournament"].isin(["FIFA World Cup"]))].copy()
In [12]:
# Create group and goals_scored columns
men_subset["group"] = "men"
women_subset["group"] = "women"
men_subset["goals_scored"] = men_subset["home_score"] + men_subset["away_score"]
women_subset["goals_scored"] = women_subset["home_score"] + women_subset["away_score"]
In [13]:
# Determine normality using histograms
women_subset["goals_scored"].hist()
plt.title("Women Goals Scored Distribution")
plt.show()
plt.clf()

men_subset["goals_scored"].hist()
plt.title("Men Goals Scored Distribution")
plt.show()
plt.clf()
No description has been provided for this image
No description has been provided for this image
<Figure size 640x480 with 0 Axes>
In [14]:
# Combine women's and men's data and calculate goals scored in each match
both = pd.concat([women_subset, men_subset], axis=0, ignore_index=True)
In [15]:
# Transform the data for the pingouin Mann-Whitney U t-test/Wilcoxon-Mann-Whitney test
both_subset = both[["goals_scored", "group"]]
both_subset_wide = both_subset.pivot(columns="group", values="goals_scored")

# Perform right-tailed Wilcoxon-Mann-Whitney test with pingouin
results_pg = pingouin.mwu(x=both_subset_wide["women"],
                          y=both_subset_wide["men"],
                          alternative="greater")

# Alternative SciPy solution
results_scipy = mannwhitneyu(x=women_subset["goals_scored"],
                             y=men_subset["goals_scored"],
                             alternative="greater")

# Extract p-value as a float
p_val = results_pg["p-val"].values[0]
In [16]:
# Determine hypothesis test result using sig. level
if p_val <= 0.01:
    result = "reject"
else:
    result = "fail to reject"

result_dict = {"p_val": p_val, "result": result}

result_dict
Out[16]:
{'p_val': 0.005106609825443641, 'result': 'reject'}